Aside

Contact Info

Skills

Skilled in several programming languages: R, C, Perl, Bash

Worked extensively with out-of-memory representation of large-scale data e.g. HDF5

Experience with cloud deployments (Open Stack, Kubernetes) and container technologies (Docker, Singularity)

Heavily used HPC infrastructure (slurm, lsf, Condor) and parallel processing tools (MPI, OpenMP)

CV last updated 20 Jan 2023.

Main

Mike L. Smith

Senior Scientific Programmer

I am a research software engineer at the European Molecular Biology Laboratory.

Employed as part of the German Network for Bioinformatics Infrastructure (de.NBI), I develop and maintain a variety of tools for the analysis of biological data. In particular, I support many packages and tools for the Bioconductor project and its community, and am a member of both the Bioconductor core team and the Community Advisory Board.

I work closely with experimental scientists and other programmers to create robust, usable, and performant analysis tools. I enjoy using software to automate tasks in order to provide rapid deployment and feedback.

I am also passionate about good software practices for reproducible research. I work extensively with version control, containers, and literate programming tools like R Markdown, champion these practices to those around me, and have taught multiple courses on these topics.

Professional Experience

Senior Scientific Programmer

European Molecular Biology Laboratory

Heidelberg, DE

2019-

  • Developed and deployed Bioconductor Code Tools website. Using Docker containers and a Kubernetes deployment, this site automatically syncs with the central Bioconductor git repository and provides tools for browsing and searching the code behind all Bioconductor packages.
  • Designed and maintained Bioconductor GitHub Actions to simplify using GitHub actions for BioC package development - grimbough/bioc-actions
  • Ported existing HDF5 compression filters to R via the rhdf5filters package, benchmarked their performance on single-cell data, and researched the effectiveness of run-length encoding and bit-packing via new filters written in C.
  • Created a continuous integration workflow using GitHub Actions for the Quarto book Modern Statistics for Modern Biology. This rapidly builds chapters in parallel, deploys a new edition of the book if successful, and alerts authors via email if any problems are encountered.
  • Organised and hosted monthly Bioconductor Developer Forum and embl-R discussion sessions
  • Consulting and/or mentoring junior software developers, both within EMBL and the wider community, to improve their R skills and develop their own packages.

Bioinformatician

European Molecular Biology Laboratory

Heidelberg, DE

2015-19

  • Maintained and continued development of several widely used Bioconductor packages with extensive userbases and thousands of downloads per month e.g. biomaRt and rhdf5. This involved:
    • Modernising and strengthening the code base via code review & development of unit tests
    • Updating documentation and vignettes
    • Providing end-user support via email, online forums, and GitHub issues
  • In collaboration with experimental biologists, developed software for the analysis of pooled CRISPR-based screens
  • Developed workflows for analysis of bulk RNA-seq data, deployed on an HPC cluster
  • Created BiocWorkflowTools for publishing R Markdown documents as both Bioconductor Workflows and publications

Research Associate

Cancer Research UK Cambridge Institute

Cambridge, UK

2013-15

  • Wrote and deployed workflows for analysing structural variation data as part of the Oesophageal ICGC project
  • Developed quality control software for Oxford Nanopore sequencing data
  • Researched the impact data quality had on downstream analysis and results

Software Development Community Engagment

Bioconductor Community Advisory Board

Elected to a three year term on the Community Advisory Board, where the aim is to engage the user and developer communities with training, outreach and a welcoming environment. As part of this I have been involved with the following:

N/A

2020-

  • New Developer Program - This program aims to encourage new developers to make the jump from scripting into package development by pairing them with more experienced mentors. As co-lead I have been responsible for designing the program, soliciting and reviewing applications from mentors and mentees, creating mentorship pairings, and checking on progress and satisfaction with the scheme.
  • Package Review Working Group - Created to review and revise the process via which packages are accepted into Bioconductor, we have systematically updated the guidelines for packages authors and submission. We have also successfully recruited a new cohort of reviewers to speed up the review process.
  • Privacy Working Group - With its large community of users and many websites and services hosted by a variety of organisations around the globe, data privacy is a serious issue for Bioconductor. We are engaged in making sure that Bioconductor services meet both legal requirements and community expectations regarding personal data privacy.

embl-R Coding Club

Co-organiser and host of EMBL’s longest running programming group. We hold bi-weekly tutorials, package demos, talks and discussions on anything R related. I have personally taught sessions on package development, data wrangling, parallel processing among others, as well as arranging the program of speakers.

N/A

2020-21

Bioconductor Developers’ Forum

Organiser and host of the monthly developers’ forum, a series of presentations and workshops intended to bring the developer community closer together. This has included presentations by members of R Core, RStudio, rOpenSci and Microsoft.
Youtube Playlist

N/A

2019-21

Education

University of Cambridge

PhD, Computational Biology,
Department of Oncology

Cambridge, UK

2009-12

Thesis: Low-level artefacts affecting microarrays and next-generation sequencing in a cancer genomics environment

Cardiff University

MSc (with Distinction) in Bioinformatics

Cardiff, UK

2007-08

Dissertation: The development of parallel processing techniques for the analysis of genome wide association studies

University of Bath

BSc (2.2) in Mathematics with Computing

Bath, UK

2003-07

Dissertation: A distributed computing approach to finding missing genes using protein threading

Teaching Experience

Advanced topics in single-cell transcriptomics

Working with on-disc data formats

Swiss Institute for Bioinformatics, Online

2020

Youtube Recording

BBSRC Advanced Methods for Reproducible Science Workshop

Introduction to R Markdown and literate programming for reproducible research

Windsor, UK

2018-20

EMBL Software Carpentry

Introduction to HPC with Slurm

Heidelberg, DE

2016-18 2020

Statistical Data Analysis for Genome-Scale Biology (CSAMA)

A one week intensive course teaching analysis of multi-omics studies. Variously I have taught, provided online and in-person technical support, administered the course website and teaching materials, and reviewed applications from students

Brixen, IT

2015-19 2022

Prizes, Awards, and Grants

CZI Funding Call - Single-cell biology

Statistical Analysis and Comprehension of the Human Cell Atlas in R / Bioconductor: Access and Scalable Infrastructure - $45,000

N/A

2018

Applied in collaboration with Wolfgang Huber

RStudio Bookdown Contest

Runner-up. Awarded for msmbstyle, a tufte inspired markdown theme.

N/A

2018

UseR 2011 - Best Technical Poster Prize

N/A

N/A

2011

BioC Conference 2011 Travel and Accommodation Scholarship

N/A

N/A

2011

Publications

First Author

  • Mike L. Smith, Andrzej K. Oleś, Wolfgang Huber. Authoring Bioconductor workflows with BiocWorkflowTools [version 1; referees: awaiting peer review]. F1000Research (2018)
  • Smith ML, Baggerly KA, Bengtsson H, Ritchie ME, Hansen KD. illuminaio: An open source IDAT parsing tool for Illumina microarrays. F1000research (2013)
  • Smith ML, Dunning MJ, Tavaré S, Lynch AG. Identification and correction of previously unreported spatial phenomena using raw Illumina BeadArray data. BMC Bioinformatics (2010)
  • Smith ML, Lynch AG. BeadDataPackR: A Tool to Facilitate the Sharing of Raw Data from Illumina BeadArray Studies. Cancer Informatics (2010)

Contributing Author

  • Rozemarijn W. D. Kleinendorst, Guido Barzaghi, Mike L. Smith, Judith B. Zaugg, Arnaud R. Krebs. Genome-wide quantification of transcription factor binding at single-DNA-molecule resolution using methyl-transferase footprinting. Nature Protocols (2021)
  • Alexandros P. Drainas, Ruxandra A Lambuta, Irina Ivanova, Özdemirhan Serçin, Ioannis Sarropoulos, Mike L. Smith, Theocharis Efthymiopoulos, Benjamin Raeder, Adrian M. Stütz, Sebastian M. Waszak, Balca R. Mardin, Jan O. Korbel. Genome-wide Screens Implicate Loss of Cullin Ring Ligase 3 in Persistent Proliferation and Genome Instability in TP53-Deficient Cells. Cell Reports (2020)
  • Amezquita RA, Lun ATL, Becht E, Carey VJ, Carpp LN, Geistlinger L, Marini F, Rue-Albrecht K, Risso D, Soneson C, Waldron L, Pagès H, Smith ML, …, Hicks SC. Orchestrating single-cell analysis with Bioconductor. Nature Methods (2020)
  • Aaron T. L. Lun, Hervé Pagès, Mike L. Smith. beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types. PLOS Computational Biology (2018)
  • James H.R. Farmery, Mike L. Smith, Andy G. Lynch. Telomerecat: A ploidy-agnostic method for estimating telomere length from whole genome sequencing data. Scientific Reports (2017)
  • Weaver JM, Ross-Innes CS, Shannon N, Lynch AG, Forshew T, Barbera M, Murtaza M, Ong CA, Lao-Sirieix P, Dunning MJ, Smith L, Smith ML, Anderson CL, Carvalho B, O’Donovan M, Underwood TJ, May AP, Grehan N, Hardwick R, OCCAMS Consortium. Ordering of mutations in preinvasive disease stages of esophageal carcinogenesis. Nature Genetics (2014)
  • Ritchie ME, Dunning MJ, Smith ML, Shi W, Lynch AG. BeadArray expression analysis using bioconductor. PLOS Computational Biology (2011)
  • Cairns J, Spyrou C, Stark R, Smith ML, Lynch AG, Tavaré S. BayesPeak - an R package for analysing ChIP-seq data. Bioinformatics (2011)
  • Moskvina V, Smith M, Ivanov D, Blackwood D, StClair D, Hultman C, Toncheva D, Gill M, Corvin A, O’Dushlaine C, Morris DW, Wray NR, Sullivan P, Pato C, Pato MT, Sklar P, Purcell S, Holmans P, O’Donovan MC, Owen MJ. Genetic differences between five European populations. Human Heredity (2010)
  • Dunning MJ, Smith ML, Ritchie ME, Tavaré S. beadarray: R classes and methods for Illumina bead-based data. Bioinformatics (2007)
  • J. Dunning, Mark, P. Thorne, Natalie, Camilier, Isabelle, L. Smith, Michael, Tavaré, Simon. Quality Control and Low-Level Statistical Analysis of Illumina BeadArray},. REVSTAT-Statistical Journal}, (2006)